Search with joint image-audio queries

ABSTRACT

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing joint image-audio queries. In one aspect, a method includes receiving, from a client device, a joint image-audio query including query image data and query audio data. Query image feature data is determined from the query image data. Query audio feature data is determined from the audio data. The query image feature data and the query audio feature data are provided to a joint image-audio relevance model trained to generate relevance scores for a plurality of resources, each resource including resource image data defining a resource image for the resource and text data defining resource text for the resource. Each relevance score is a measure of the relevance of corresponding resource to the joint image-audio query. Data defining search results indicating the order of the resources is provided to the client device.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application of, and claims priorityto, U.S. patent application Ser. No. 12/914,653, now U.S. Pat. No.8,788,434, entitled “Search with Joint Image-Audio Queries”, filed onOct. 28, 2010. The disclosure of the foregoing application isincorporated herein by reference in its entirety for all purposes.

BACKGROUND

This specification relates to processing queries, particularly toqueries including both an image and associated audio.

The Internet provides access to a wide variety of resources, forexample, video files, image files, audio files, or Web pages includingcontent for particular subjects, book articles, or consumer products. Asearch system can select one or more resources in response to receivinga search query. A search query is data that a user submits to a searchengine to satisfy the user's informational needs. The search systemselects and scores resources based on their relevance to the searchquery. The search results are typically ordered according to the scores,and provided in a search results page.

To search image resources, a search system can determine the relevanceof an image to a text query based on the textual content of the resourcein which the image is located and also based on relevance feedbackassociated with the image. Some search systems search image resources byusing query images as input. A query image is an image, such as a jpegfile, that is used by a search engine as input to a search processingoperation. Related images can be found by processing other images andidentifying images that are similar in visual appearance to the queryimage. The use of query images is becoming much more prevalent with theadvent of smart phones that include cameras. For example, using a smartphone, a user can now take a picture of a subject of interest, andsubmit the picture to a search engine. The search engine then searchesimage resources using the picture as a query image.

However, viewers interpret images in a much more subjective manner thantext. Thus, while the images that are identified may be similar inappearance to the query image, many of the images may not be of interestto the viewer. For example, a user may conduct a search on an image of acar. The user may be interested in other cars of that brand, but animage search based only on visual similarity might return images of carsof different brands.

SUMMARY

In general, one innovative aspect of the subject matter described inthis specification can be embodied in methods that include the actionsof receiving a joint image-audio query from a client device, the jointimage-audio query including query image data defining a query image andquery audio data defining query audio; determining query image featuredata from the query image data, the query image feature data describingimage features of the query image; determining query audio feature datafrom the audio data, the query audio feature data describing audiofeatures of the query audio; providing the query image feature data andthe query audio feature data to a joint image-audio relevance model, thejoint image-audio relevance model trained to generate relevance scoresfor a plurality of resources, wherein each resource includes resourceimage data defining a resource image for the resource and text datadefining resource text for the resource, and wherein each relevancescore is a measure of the relevance of corresponding resource to thejoint image-audio query; ordering the resources according to thecorresponding relevance score; and providing data defining searchresults indicating the order of the resources to the client device.Other embodiments of this aspect include corresponding systems,apparatus, and computer programs, configured to perform the actions ofthe methods, encoded on computer storage devices.

Another aspect of the subject matter described in this specification canbe implemented in methods that include the actions of accessing imageannotation data describing a plurality of annotation pairs, eachannotation pair including image data defining an image and text dataassociated with the image; accessing resources, each resource defining aresource image for the resource and text data defining resource text forthe resource; and training a joint image-audio relevance model on theimage annotation data and the resources to generate relevance scores fora plurality of resources, and wherein each relevance score is a measureof the relevance of a corresponding resource to a joint image-audioquery that includes query image data defining a query image and queryaudio data defining query audio. Other embodiments of this aspectinclude corresponding systems, apparatus, and computer programs,configured to perform the actions of the methods, encoded on computerstorage devices.

Particular embodiments of the subject matter described in thisspecification can be implemented so as to realize one or more of thefollowing advantages. Adding audio data to an image query can improverelevance of search results on the query. Relevance can be improved bothby providing information that can aid the system in extracting theobject of interest in an image, and also by providing information thatsupplements the user's search beyond what can be found in the image.This information can also be added in various other ways. In someembodiments, a portion of the image can be selected as containing theobject of interest by the user drawing a circle on the image using atouch screen. The user can also outline the object of interest moreclosely than a circle or other shape, and can also draw the outlineusing other input methods. In some embodiments, the user can addadditional information regarding the image using a dropdown menu box.The menu box can have different categories of items, such as shoppingcategories including shoes, shirts, pants, and others similarcategories.

The details of one or more embodiments of the subject matter describedin this specification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an example environment in which a jointimage-audio search system provides search services.

FIGS. 2A and 2B are example images for a joint image-audio query.

FIG. 3A is a block diagram of an example process for querying a jointimage-audio search system.

FIG. 3B is a block diagram of an example process for training a jointimage-audio relevance model.

FIG. 4 is a flow chart of an example process for training a jointimage-audio relevance model.

FIG. 5 is a flow chart of an example process for ranking resources for ajoint image-audio query.

DETAILED DESCRIPTION §1.0 Overview

An application running on a mobile phone allows a user to take a pictureof an object and speak into the phone to record the user's speech. Theaudio recording is paired with the image to form a joint image-audioquery. The mobile device then submits the joint image-audio query to asearch system.

The search system receives the joint image-audio query and determinestext data from the speech and generates image feature data from theimage. The search system uses the text data and the image feature dataas inputs into a joint image-audio relevance model, which comparesresources to the input data. The resources can be any of the resourcesfound on the Internet, including web pages, documents, images, andvideo. As one example, each resource can be a document for a product,which includes an image of the product and associated text data of theproduct. The joint image-audio relevance model compares the query imagefeature data to the image feature data of each resource and the querytext data to the corresponding resource text data and computes arelevance score for each resource. The system orders the resourcesaccording to the relevance scores and presents search results to theuser. The search results include links to the ordered resources, and mayalso include additional information about each resource, for example,thumbnails of the resource image or subsets of the resource text.

For the joint image-audio relevance model to determine the relevance ofa joint image-audio query to the resources, the model is first trained.Training the model involves using image annotation data, which areannotation pairs. Each annotation pair is an image paired with text dataassociated with the image. These annotation pairs are used as traininginputs to the joint image-audio relevance model, along with training andtesting resources for the annotation pairs. The joint image-audiorelevance model is trained until the testing resources are ranked in amanner that is deemed to be acceptable, as defined by one or morecriteria.

FIG. 1 is a block diagram of an example environment 100 in which a jointimage-audio search system 106 provides search services. The exampleenvironment 100 includes a network 104, such as the Internet, connectinga user device 102 to a search system 106. The user device 102 transmitsa joint image-audio query 120 that includes a pairing of image 122 andaudio 124 data over the network 104 to the search system 106. Exampleaudio 124 is a speech recording. The system 106 processes the image 122and audio 124 data and compares them to a collection of resources 116,computing a relevance score for each resource 116. The system 106 ranksthese resources 116 by their relevance scores and sends a list of searchresults, each of which includes a resource link 130 to a correspondingresource, to the user device 102.

The user device 102 is an electronic device that is under control of auser and is capable of requesting and receiving resources 116 over thenetwork 104. Example user devices 102 include personal computers, mobilecommunication devices, and other devices that can send and receive dataover the network. A user device 102 typically includes a userapplication, e.g., a web browser, to facilitate the sending andreceiving of data over the network 104. The user device 102 may alsoinclude a camera and a microphone for acquiring an image 122 and audio124. The user device also includes an application that pairs the audio124 with the image 122 to form a joint-image audio query. The queryaudio 124 typically includes speech data that provides more informationabout the image 122 or about the user's search parameters.

By way of example, assume a user is searching for a water bottle, andthe query image 122 is a picture of a water bottle taken by the userdevice 102. Refer, for example, to FIG. 2A. In FIG. 2A, the image mayinclude more than just the water bottle. After taking the image, theuser specifies that the water bottle is the object of interest in thepicture by augmenting the query image 122 with the query audio 124,“water bottle.” Alternatively, the user may provide more specificinformation, for example, by including “red water bottle” as the queryaudio 124. The query audio 124 may also include positional information,for example if there is more than one object in the query image 122, theuser may specify by submitting the query audio 124, “red bottle on theright.”

Even with a single object in the picture, audio information may improvethe results of a search. For example, FIG. 2B contains only the waterbottle in the picture. However, if a search were conducted to findsimilar images based on the visual features alone, the results mayinclude only bottles that have a similar shape and color, and may notinclude other types of water bottles. By augmenting the image with theaudio, e.g., “water bottle” or “water bottle for bicycle rides,” thesystem provides additional information to the search system, and thesearch system uses this additional information to provide search resultsthat are likely to satisfy the user's informational needs.

Further, the user may also provide parameters by use of audio torestrict the search results. For example, the user may be searching aproduct database to find a water bottle for purchase. The user mayprovide to the search system the image 122 of the water bottle and thequery audio 124, “Brand X water bottle under ten dollars,” or as anotherexample, “this water bottle in blue.”

Referring back to FIG. 1, the search system 106 receives the jointimage-audio query that includes the image 122 data and the audio 124data from the user device 102 through the network 104. In someimplementations, the search system 106 includes an image processingapparatus 110 to generate image feature data from the image 122 data.Alternatively, in other implementations, the search system passes theimage 122 data to a separate image processing apparatus 110 and receivesthe image feature data from the separate image processing apparatus 110.Similarly, the search system 106 may also include a speech processingapparatus 112 to extract text data from the audio 124 data, or it maypass the audio 124 data to a separate speech processing apparatus 112and receive the text data.

The search system 106 uses the image feature data and the text dataderived from the joint image-audio query as input to a joint image-audiorelevance model 108. The joint image-audio relevance model 108 receivesthese two inputs and also receives resources 116. The joint image-audiorelevance model 108 scores each resource 116 indicating a measure ofrelevance of the resource 116 to the joint image-audio query.

In some implementations, the search system, using the joint image-audiorelevance model 108, computes a score for each resource according to thefollowing ranking function:

REL _(i) =f(S,I, R _(i))

where

REL_(i) is a relevance score for a resource R_(i);

S is the audio data 124;

I is the image data 122; and

R_(i) is a given resource in a resource database or cache. The functionf(S, I, R) is described in more detail with respect to FIG. 3B below.

A resource 116 is any data that can be provided over a network 104 andis associated with a resource address or indexed in a database. In someimplementations, a resource database 114 comprises a collection ofresources 116, with each resource 116 including a resource image andresource text. One example of a resource database 114 is a productdatabase that includes product documents comprising an image of aproduct and data describing the product, such as brand name, price, anda textual description.

For each i^(th) resource, the search system 106 determines resourceimage feature data from the resource image in a manner similar to how itdetermines query image feature data from the query image. The searchsystem 106 also determines resource text data from the resource 116. Thejoint image-audio relevance model 108 then compares the query imagefeature data to the resource image feature data and the query text datato the resource text data of a resource 116 and computes a relevancescore REL for the resource 116. The model 108 provides the relevancescores to the search system 106. The search system 106 then orders theresources according to the relevance scores, and provides search results130, ranked by the relevance scores of the resources, to the user device102.

§2.0 Processing A Joint Image-Audio Query

FIG. 3A is a block diagram of an example process 300 for querying ajoint image-audio search system. The search system 106 receives thejoint image-audio query, comprising image data 302 and audio data 304.This data is received through the network and, in some implementations,the image data 302 is a picture taken of a query object by a user. Theaudio data 304 includes speech recorded by the user containinginformation about the query object or about the desired query results.These are paired together as the joint image-audio query.

The audio data 304 includes audio pertaining to speech. The speech data304 is converted to text data 308 using speech recognition algorithms.The text 308 is further analyzed using natural language processingtechniques to parse the content of the text data 308. For example, theimage 302 in the joint image-audio query may contain a water bottle, asin FIG. 2A. The audio data 304 accompanying this image may simply be,“water bottle.” The search system 106 converts this speech 304 into textdata 308 and uses the text 308 as a search parameter when comparing withresource text data.

Using natural language processing, the search system 106 can determinespatial areas of the image for inclusion or exclusion. For example, theaudio 304 may contain the speech, “water bottle in the right of thepicture.” The search system 106 converts this speech 304 into text data308 and parses the statement. The system 106 determines that the rightof the picture is an area of interest from the phrase “in the right ofthe picture,” and thus ignores features and objects recognized in theleft of the picture 302 and focuses only on those found on the right.

Using natural language processing, the search system 106 can detectsentiments for particular features or characteristics. For example, theimage 302 in the joint image-audio query may contain a red water bottle,as in FIG. 2B. The audio 304, however, may contain the speech, “onlyblue water bottles, not red.” The search system 106 converts this speech304 into text data 308 and parses the statement to interpret that theuser wants only blue water bottles in the search results, as opposed tothe red water bottle in the image 302.

From the image data 302 of the image-audio query, the search system 106generates image feature value data 306. Image feature value data 306 arevalue scores that represent visual characteristics of a portion of animage 302. The portion of the image can include the entirety of theimage 302, or a sub-portion of the image. In some implementations, theimage features 306 can include color, texture, edges, saturation, andother characteristics. Example processes for extracting values of imagefeatures 306 from which a feature score can be computed includeprocesses for generating color histograms, texture detection processes(e.g., based on spatial variation in pixel intensities), scale-invariantfeature transform, edge detection, corner detection, and geometric blur.

The joint image-audio relevance model 108 receives the image featuredata 306 and text data 308. The model 108 also accesses resources 314 ina collection of resources. With each resource 314 accessed, the model108 generates resource image feature data from the resource image, in amanner similar to the query image 302. The model 108 also determinestext data from the resource 314, such as text on a web page thatincludes the image, or text associated with the image according to adatabase schema (e.g., a database of commercial products). The model 108compares the query image feature data with the resource image featuredata, and the query text data with the resource text data and computes arelevance score for that resource 314. The model 108 computes relevancescores for each resource in the collection of resources, ranks theresources according to the scores, and returns a ranked list of theresources 312. The search system 106 then generates search results thatreference the images and resources, and provides the search results tothe user.

In some implementations, this process may be repeated iteratively one ormore times. For example, after producing a list of resources ranked byrelevancy 312 to the image-audio query 302, 304, the system 106 may useone or more of the highest ranked resource images to run another query.This may produce an improved list of relevance resources. Alternativelyor in combination, the system may use resource text data from one ormore highest ranked resources in addition to or in place of the originalquery text data 308.

§3.0 Training The Joint Image-Audio Relevancy Model

For the joint image-audio relevancy model 310 to be able to correctlycompute relevance scores, it is first trained. FIG. 3B is a blockdiagram of an example process 350 for training a joint image-audiorelevance model 108. The model is trained using annotation pairs.Similar to a joint image-audio query, an annotation pair has image data352 and associated audio data 354. The set of annotation pairs can bepartitioned into a training set and a testing set.

Taking annotations pairs from the training set, image feature data 358is generated from the annotation image data 352 using similar imageprocessing algorithms as those used on the query image. Text data 360 isdetermined from the annotation audio data 354 using similar speechrecognition and natural language processing techniques as those used onthe query audio. A training model 362 receives as input the imagefeature data 358 and the text data 360. The training model 362 alsoreceives as input a resource 356 with a predetermined relevance to theannotation pair 352, 354. This predetermined relevance may be binary(e.g. relevant/not relevant), or on a relative scale (e.g., highlyrelevant, somewhat relevant, not relevant), or on a scale with morerefined values. The model 362 generates resource image feature data fromthe resource image and determines resource text data from the resourcetext. Comparing the annotation image feature data 352 to the resourceimage feature data and the annotation text data 354 to the resource textdata, the training model 362 computes a relevance score. Weights thatcorrespond to the image features and text features are adjusted toproduce a score in the correct range of the predetermined relevancy.This process is repeated for different resources and with differenttraining annotation pairs, all with predetermined relevancies.

The testing set of annotation data may then be used to verify thetrained model. The trained model may receive as input annotation pairsfrom the testing set, along with resources that have predeterminedrelevance to each of the testing pairs. The testing pairs and resourceswould be processed to generate feature data as done with the trainingpairs. The model would then generate relevance scores for each of thesesets of inputs. If the relevance scores are within a threshold range ofacceptability, then the model is adequately trained. If, however, themodel generates relevance scores that are not within the threshold rangeof acceptability, then the model is not adequately trained and thetraining process may be repeated with the training set of annotationdata, and the assigned weights reevaluated and readjusted.

This threshold range may be established many different ways. Forexample, each of the qualitative scale values in the predeterminedrelevance scale can be assigned relevance score ranges. For example, ifthe relevance scores generated by the model range from 1 to 100, in abinary predetermined relevance scale, the threshold may be set atgreater than or equal to 50 for relevant and less than 50 for notrelevant. Alternatively, the threshold may be made more stringent byassigning, for example, greater than 75 for relevant and less than 25for not relevant. This may provide for a more effective image-audiorelevance model, but may also require more iterations of training toproduce. Alternatively, the threshold of acceptability may be morequalitative. For example, for a given annotation pair, there may be aset of resources, with a predetermined ranking from more relevant toless relevant. The acceptability of the training of the model may beevaluated by seeing how close the trained model comes to providing thecorrect ranking of the resources for the annotation pair.

§3.1 Selection Of Annotation Pair Data

The annotation data may be obtained in a variety of ways. In oneimplementation, the annotation data is derived from a product database,the product database having a collection of product documents. Eachproduct document has an image of a product and associated text withinformation regarding the product, such as a description, prices,sellers of the product, and reviews and ratings of both the product andsellers of the product. The annotation pair 352, 354 includes the imagefrom a product document and a subset of the text from the same document.This would also allow for a predetermined relevance between the productdocument and the annotation pair 352, 354 created from that document.Since the annotation pair 352, 354 was created from that productdocument, the annotation pair must be highly relevant to the productdocument.

In another implementation, the annotation data is derived from selectiondata from image search result data. Query input text entered by usersinto an image search system may be used as the annotation text data 354of an annotation pair. The annotation image data 352 for the pair may bechosen from images that are the most popular results from the imagesearch corresponding to the query input. The popularity of results maybe determined by statistical measures such as click through rate.Alternatively, the annotation data may be from selection data fromproduct search result data. The query input can again be used as theannotation text data 354 for an annotation pair. The annotation image352 may be obtained from the product image of the most popular productdocuments selected by users for that query input. This would alsoprovide product documents to use as resources with high predeterminedrelevance.

In another implementation, the annotation data is derived from selectiondata from general web search result data. Query input text entered byusers into a web search system may be used as the annotation text data354 of an annotation pair. The web search system may return general webresources, including websites, images, and product documents. If theuser selects a product document as a result of the web search, theproduct image may be used as the annotation image data 352 for theannotation pair. The product document is then used as the resource withknown high relevancy.

In another implementation, human annotators may be used to providetraining data. The annotators may take a photograph to provide theannotation image 352, and provide speech or text data for the annotationtext data 354 of resources they wish to search for. The annotators maythen search through a product document or other resource database andfind resources that are both related and unrelated to the photograph andspeech data they provided. For each resource they find, the annotatorscan then label it as a good quality match or a poor quality match. Inanother implementation, the annotators may be used to rate the qualityof matches determined through an automated procedure. For example, anyof the previously discussed procedures may be used to obtain annotationdata from a product database, product search selection data, imagesearch selection data, or web search selection data, and humanannotators may rate the relevance of each annotation pair with theresource selected by the automated process.

§3.2 Example Scoring Models

A variety of models can be used to realize the relevance function f(S,I, R), and examples are described below. One example model implements arelevance function that is a linear combination of constituent modelstrained on image feature data and audio and text data, i.e.,

f(S,I,R)=cf _(s)(S,R)+(1−c)f _(I)(I,R)

where f_(s) is a scoring function trained on the speech and text data,and f_(I) is a scoring function trained on the image feature data. Themixing parameter c is a value that is adjusted between 0 and 1.

Another example model implements a relevance function f(S, I, R) thatrestricts the set of resource items considered to only those withtextual descriptions that contain the words in S. Using this restrictedset, the model then scores on the relevance of the image feature data.Thus, the relevance function would be

f(S,I,R)=f _(I)(I,R)f(S,R)

where f(S, R)=1 if text S is in the resource R and f(S, R)=0 otherwise.

Another example model implements a relevance function f(S, I, R) wherean image feature relevance function is learned for every possible choiceof S, i.e.,

f(S,I,R)=W _(S)·Φ(I,R)

where Φ(I, R) is the feature representation of the image and resource,and W_(s) is a learned feature weight matrix of features representingimages and resources. W_(s) is a 1×|Φ(I, R)| matrix, or a vector ofdimension |Φ(I, R)|, that is, the number of features that are used torepresent the image and resource.

Yet another example model implements a relevance function using onelinear ranking function, i.e.,

f(S,I,R)=W•Φ(S,I,R)).

Another example model implements a relevance function that is anextension to the approach found in the paper, “Large Scale ImageAnnotation: Learning to Rank with Joint Word-Image Embeddings,” by JasonWeston, Samy Bengio, and Nicolas Usunier (“Weston paper”), incorporatedherein by reference. The approach in the Weston paper involves trainingon an “embedding space” representation of arbitrary dimension, wheredistance between two items in the space denotes their similarity. Thismodel involves the function

f(S,I,R)=(W _(SI)•Φ_(SI)(S,I))•(W _(R)•Φ_(R)(R))

where W_(SI) and W_(R) are matrices, and the method learns both matricesand an embedding space of dimension R that is typically low dimensional.W_(SI) is an R×|Φ_(SI)(S,I)| matrix, where R is the dimensions of theembedding space and |Φ_(SI)(S, I)| is the number of features used torepresent the text and image jointly. W_(R) is an R×|Φ_(R)(R)| matrix,where |Φ_(R)(R)| is the number of features used to represent theresource. The embedding space in the Weston paper only used images andlabels. The approach is extended here by concatenating the speech andimage features in a single features space Φ_(SI)(S, I).

Another example model implements a relevance function that furtherextends the Weston paper approach. The relevance function is defined by

f(S,I,R)=Σ(W _(S)•Φ_(S)(S))*(W _(I)Φ_(I)(I))*(W _(R)•Φ_(R)(R))

where the * operation is the component-wise multiplication of vectors.This function allows for more complex nonlinear interactions between thefeatures of the image, speech, and resource.

§4.0 Example Processes

FIG. 4 is a flowchart of an example process 400 for training a jointimage-audio relevance model 108. The process 400 can be implemented inthe search system 106 and is used to train a joint image-audio relevancemodel 108.

The process 400 accesses image annotation data (402). For example, thesearch system 106 accesses image annotation data from a productdatabase. The search system 106 may also access image annotation datafrom product search selection data. In another example, the searchsystem 106 accesses image annotation data from image search selectiondata. In another implementation, the search system 106 accesses imageannotation data from web search selection data. The search system 106may also access image annotation data from data annotated by humanannotators. The human annotators may create their own image and speechdata to annotate, or may access data to annotate from a product databaseor another automated process.

The process 400 accesses resources (404). For example, the search system106 accesses resources comprising product documents from a productdatabase.

The process 400 trains a joint image-audio relevance model on the imageannotation data and resources (406). For example, the search system 106trains a joint image-audio relevance model using the image annotationdata from the product database and the resources from the productdatabase. The joint-image audio relevance model can, for example, betrained according to any of the training algorithms described in section3.2 above, or other training algorithms can be used.

FIG. 5 shows a flowchart of an example process 500 for ranking resourcesfor a joint image-audio query. The process 500 can be implemented in thesearch system 106 and is used to rank resources for a joint image-audioquery.

The process 500 receives a joint image-audio query (502). For example,the search system 106 receives a joint image-audio query from a userdevice through the network.

The process 500 determines query image feature data (504). For example,the search system 106 generates image feature value data from the queryimage received from the user device.

The process 500 determines query audio feature data (506). For example,the search system 106 processes the audio data to generate text datafrom audio data comprising speech data.

The process 500 provides query image feature data and query audiofeature data to the joint image-audio relevance model (508). Forexample, the search system 106 provides query image feature data andtext data to the joint image-audio relevance model. The jointimage-audio relevance model is trained to generate relevance scores fora collection of resources.

The process 500 orders resources according to their relevance scores(510). For example, the search system 106 orders the resources from mostrelevant to least relevant to the image-audio query.

The process 500 provides search results indicating the order of theresources (512). For example, the search system 106 provides searchresults comprising a list of resource addresses, ranked from mostrelevant to least relevant to the user device.

§5.0 Additional Implementation Details

Implementations of the subject matter and the operations described inthis specification can be implemented in digital electronic circuitry,or in computer software, firmware, or hardware, including the structuresdisclosed in this specification and their structural equivalents, or incombinations of one or more of them. Implementations of the subjectmatter described in this specification can be implemented as one or morecomputer programs, i.e., one or more modules of computer programinstructions, encoded on computer storage medium for execution by, or tocontrol the operation of, data processing apparatus. Alternatively or inaddition, the program instructions can be encoded on an artificiallygenerated propagated signal, for example, a machine-generatedelectrical, optical, or electromagnetic signal, that is generated toencode information for transmission to suitable receiver apparatus forexecution by a data processing apparatus. A computer storage medium canbe, or be included in, a computer-readable storage device, acomputer-readable storage substrate, a random or serial access memoryarray or device, or a combination of one or more of them. Moreover,while a computer storage medium is not a propagated signal, a computerstorage medium can be a source or destination of computer programinstructions encoded in an artificially generated propagated signal. Thecomputer storage medium can also be, or be included in, one or moreseparate physical components or media (for example, multiple CDs, disks,or other storage devices).

The operations described in this specification can be implemented asoperations performed by a data processing apparatus on data stored onone or more computer-readable storage devices or received from othersources.

The term “data processing apparatus” encompasses all kinds of apparatus,devices, and machines for processing data, including by way of example aprogrammable processor, a computer, a system on a chip, or multipleones, or combinations, of the foregoing The apparatus can includespecial purpose logic circuitry, for example, an FPGA (fieldprogrammable gate array) or an ASIC (application specific integratedcircuit). The apparatus can also include, in addition to hardware, codethat creates an execution environment for the computer program inquestion, for example, code that constitutes processor firmware, aprotocol stack, a database management system, an operating system, across-platform runtime environment, a virtual machine, or a combinationof one or more of them. The apparatus and execution environment canrealize various different computing model infrastructures, such as webservices, distributed computing and grid computing infrastructures.

A computer program (also known as a program, software, softwareapplication, script, or code) can be written in any form of programminglanguage, including compiled or interpreted languages, declarative orprocedural languages, and it can be deployed in any form, including as astand alone program or as a module, component, subroutine, object, orother unit suitable for use in a computing environment. A computerprogram may, but need not, correspond to a file in a file system. Aprogram can be stored in a portion of a file that holds other programsor data (for example, one or more scripts stored in a markup languagedocument), in a single file dedicated to the program in question, or inmultiple coordinated files (for example, files that store one or moremodules, sub programs, or portions of code). A computer program can bedeployed to be executed on one computer or on multiple computers thatare located at one site or distributed across multiple sites andinterconnected by a communication network.

The processes and logic flows described in this specification can beperformed by one or more programmable processors executing one or morecomputer programs to perform actions by operating on input data andgenerating output. The processes and logic flows can also be performedby, and apparatus can also be implemented as, special purpose logiccircuitry, for example, an FPGA (field programmable gate array) or anASIC (application specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processors of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read only memory ora random access memory or both. The essential elements of a computer area processor for performing actions in accordance with instructions andone or more memory devices for storing instructions and data. Generally,a computer will also include, or be operatively coupled to receive datafrom or transfer data to, or both, one or more mass storage devices forstoring data, for example, magnetic, magneto optical disks, or opticaldisks. However, a computer need not have such devices. Moreover, acomputer can be embedded in another device, for example, a mobiletelephone, a personal digital assistant (PDA), a mobile audio or videoplayer, a game console, a Global Positioning System (GPS) receiver, or aportable storage device (for example, a universal serial bus (USB) flashdrive), to name just a few. Devices suitable for storing computerprogram instructions and data include all forms of non volatile memory,media and memory devices, including by way of example semiconductormemory devices, for example, EPROM, EEPROM, and flash memory devices;magnetic disks, for example, internal hard disks or removable disks;magneto optical disks; and CD ROM and DVD-ROM disks. The processor andthe memory can be supplemented by, or incorporated in, special purposelogic circuitry.

To provide for interaction with a user, implementations of the subjectmatter described in this specification can be implemented on a computerhaving a display device, for example, a CRT (cathode ray tube) or LCD(liquid crystal display) monitor, for displaying information to the userand a keyboard and a pointing device, for example, a mouse or atrackball, by which the user can provide input to the computer. Otherkinds of devices can be used to provide for interaction with a user aswell; for example, feedback provided to the user can be any form ofsensory feedback, for example, visual feedback, auditory feedback, ortactile feedback; and input from the user can be received in any form,including acoustic, speech, or tactile input. In addition, a computercan interact with a user by sending documents to and receiving documentsfrom a device that is used by the user; for example, by sending webpages to a web browser on a user's client device in response to requestsreceived from the web browser.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anydisclosures or of what may be claimed, but rather as descriptions offeatures specific to particular implementations of particulardisclosures. Certain features that are described in this specificationin the context of separate implementations can also be implemented incombination in a single implementation. Conversely, various featuresthat are described in the context of a single implementation can also beimplemented in multiple implementations separately or in any suitablesubcombination. Moreover, although features may be described above asacting in certain combinations and even initially claimed as such, oneor more features from a claimed combination can in some cases be excisedfrom the combination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various systemcomponents in the implementations described above should not beunderstood as requiring such separation in all implementations, and itshould be understood that the described program components and systemscan generally be integrated together in a single software product orpackaged into multiple software products.

Thus, particular implementations of the subject matter have beendescribed. Other implementations are within the scope of the followingclaims. In some cases, the actions recited in the claims can beperformed in a different order and still achieve desirable results. Inaddition, the processes depicted in the accompanying figures do notnecessarily require the particular order shown, or sequential order, toachieve desirable results. In certain implementations, multitasking andparallel processing may be advantageous.

What is claimed is:
 1. (canceled)
 2. A computer-implemented methodperformed by a data processing apparatus, the method comprising:receiving a joint image-audio query from a client device, the jointimage-audio query including query image data defining a query image andquery audio data defining query audio; deriving text from the queryaudio data; determining, from the text, positional information thatdescribes a position of a query object in the query image data, theposition being a subset of the query image and in which the query objectis depicted; determining query image feature data from the query imagedata, the query image feature data describing image features of thesubset of the query image; providing the query image feature data andthe text data to a relevance model as input; ordering, by the dataprocessing apparatus, the identified resources according tocorresponding relevance scores generated by the relevance model; andproviding, by the data processing apparatus, data defining searchresults indicating the order of the identified resources to the clientdevice.
 3. The computer-implemented method of claim 2, wherein the queryimage data and the query audio data were paired as the joint-image audioquery at the client device.
 4. The computer-implemented method of claim2, wherein the relevance model is a joint image-audio relevance modelthat is trained to generate the relevance scores for a plurality ofresources based on a combined relevance of the query image feature datato image feature data of the resource and the text derived from thequery audio to text of the resource.
 5. The computer-implemented methodof claim 4, wherein the joint image-audio relevance model generatesrelevance scores based, in part, from only image feature data determinedfrom the subset of the image data.
 6. The computer-implemented method ofclaim 2, wherein the relevance model generates relevance scores based,in part, on the positional information.
 7. The computer-implementedmethod of claim 2, wherein the text further defines one or morerestrictions on the search results, and the relevance model generatesrelevance scores based, in part, on the one or more restrictions.
 8. Asystem, comprising: a data processing apparatus; and a computer storagemedium encoded with a computer program, the program comprisinginstructions that when executed by the data processing apparatus causethe data processing apparatus to perform operations comprising:receiving a joint image-audio query from a client device, the jointimage-audio query including query image data defining a query image andquery audio data defining query audio; deriving text from the queryaudio data; determining, from the text, positional information thatdescribes a position of a query object in the query image data, theposition being a subset of the query image and in which the query objectis depicted; determining query image feature data from the query imagedata, the query image feature data describing image features of thesubset of the query image; providing the query image feature data andthe text data to a relevance model as input; ordering, by the dataprocessing apparatus, the identified resources according tocorresponding relevance scores generated by the relevance model; andproviding, by the data processing apparatus, data defining searchresults indicating the order of the identified resources to the clientdevice.
 9. The system of claim 8, wherein the query image data and thequery audio data were paired as the joint-image audio query at theclient device.
 10. The system of claim 8, wherein the relevance model isa joint image-audio relevance model that is trained to generate therelevance scores for a plurality of resources based on a combinedrelevance of the query image feature data to image feature data of theresource and the text derived from the query audio to text of theresource.
 11. The system of claim 10, wherein the joint image-audiorelevance model generates relevance scores based, in part, from onlyimage feature data determined from the subset of the image data.
 12. Thesystem of claim 8, wherein the relevance model generates relevancescores based, in part, on the positional information.
 13. The system ofclaim 8, wherein the text further defines one or more restrictions onthe search results, and the relevance model generates relevance scoresbased, in part, on the one or more restrictions.
 14. A computer storagedevice encoded with a computer program, the program comprisinginstructions that when executed by a client device cause the clientdevice to perform operations comprising: receiving a joint image-audioquery from a client device, the joint image-audio query including queryimage data defining a query image and query audio data defining queryaudio; deriving text from the query audio data; determining, from thetext, positional information that describes a position of a query objectin the query image data, the position being a subset of the query imageand in which the query object is depicted; determining query imagefeature data from the query image data, the query image feature datadescribing image features of the subset of the query image; providingthe query image feature data and the text data to a relevance model asinput; ordering, by the data processing apparatus, the identifiedresources according to corresponding relevance scores generated by therelevance model; and providing, by the data processing apparatus, datadefining search results indicating the order of the identified resourcesto the client device.
 15. The computer storage device of claim 14,wherein the query image data and the query audio data were paired as thejoint-image audio query at the client device.
 16. The computer storagedevice of claim 14, wherein the relevance model is a joint image-audiorelevance model that is trained to generate the relevance scores for aplurality of resources based on a combined relevance of the query imagefeature data to image feature data of the resource and the text derivedfrom the query audio to text of the resource.
 17. The computer storagedevice of claim 16, wherein the joint image-audio relevance modelgenerates relevance scores based, in part, from only image feature datadetermined from the subset of the image data.
 18. The computer storagedevice of claim 14, wherein the relevance model generates relevancescores based, in part, on the positional information.
 19. The computerstorage device of claim 14, wherein the text further defines one or morerestrictions on the search results, and the relevance model generatesrelevance scores based, in part, on the one or more restrictions.